[LinearLayouts] Faster pext algorithm #5621

lezcano · 2025-01-15T16:16:10Z

We also skip the LinearLayout test for HIP as it's currently failing.

Regarding the use of getWarpSize and getNumWarpsPerCTA, which are not correct for LinearLayouts with broadcasting as noted in #5617, we found almost all the uses are in AMD land. Changing these into calling the functions that act on the module is tricky, as the module is not currently accessible at the caller site in most of them. As such, we leave this refactor up to AMD folks.

We also skip the LinearLayout test for HIP as it's currently failing

lezcano · 2025-01-15T16:42:41Z

python/test/unit/language/test_core.py

+    if is_hip() and isinstance(src_layout, LinearLayout):
+        pytest.skip("FIXME: LinearLayout not supported on HIP")


cc @antiagainst regarding the HIP skip. See also the warpSize / numWarp comment in the OP.

This PR fixes a typo in the Windows implementation of `__builtin_clz` that was introduced in #5621. According to [this in-code comment](https://github.com/triton-lang/triton/blob/b3dcc32f387d1d54ccd6cbbbc087296c0539e703/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L12) these Windows implementations should have been copied from [this gist snippet](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0). In the snippet however the `clz` implementation additionally [XORs the result of `_BitScanReverse`](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0#file-ctz_clz-cpp-L51-L53) in order to convert the result from the most significant bit produced by `_BitScanReverse` to the expected number of leading zeros. I believe the implementation was copied to the triton without the finalizing XOR by accident. What is affected by this error? This implementation of CLZ is used in [`pext_i32`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L635) that is used in [`delinearize`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L662) that is used by [`ReduceOpToLLVM`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp#L243-L247) pattern. This bug caused `tt.reduce()` ops to be incorrectly lowered on Windows in cases, where shared memory is needed to store temporary reduced results. Signed-off-by: dchigarev <dmitry.chigarev@intel.com>

Closes #3273 This recent PR in upstream (triton-lang/triton#5621) brought a new faster logic for `pext_i32` that is used in `ReduceOpToLLVM` pattern. The new logic of `pext_i32` uses `__builtin_clz` intrinsic, that is natively available in GCC and Clang, but is missing in MSVC. It seems that the Windows version of this intrinsic was incorrectly copied from [the given source](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0#file-ctz_clz-cpp-L44-L55), so that it misses `r ^ 31` at the end of it, causing `tt.reduce(...)` lowering to produce incorrect llvm IR in some scenarious. Signed-off-by: dchigarev <dmitry.chigarev@intel.com>

…5774) This PR fixes a typo in the Windows implementation of `__builtin_clz` that was introduced in triton-lang#5621. According to [this in-code comment](https://github.com/triton-lang/triton/blob/b3dcc32f387d1d54ccd6cbbbc087296c0539e703/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L12) these Windows implementations should have been copied from [this gist snippet](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0). In the snippet however the `clz` implementation additionally [XORs the result of `_BitScanReverse`](https://gist.github.com/pps83/3210a2f980fd02bb2ba2e5a1fc4a2ef0#file-ctz_clz-cpp-L51-L53) in order to convert the result from the most significant bit produced by `_BitScanReverse` to the expected number of leading zeros. I believe the implementation was copied to the triton without the finalizing XOR by accident. What is affected by this error? This implementation of CLZ is used in [`pext_i32`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L635) that is used in [`delinearize`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/Utility.cpp#L662) that is used by [`ReduceOpToLLVM`](https://github.com/intel/intel-xpu-backend-for-triton/blob/4a9967137548f8fe9b1a93383e4fd12646352231/lib/Conversion/TritonGPUToLLVM/ReduceOpToLLVM.cpp#L243-L247) pattern. This bug caused `tt.reduce()` ops to be incorrectly lowered on Windows in cases, where shared memory is needed to store temporary reduced results. Signed-off-by: dchigarev <dmitry.chigarev@intel.com>

lezcano added 2 commits January 15, 2025 16:15

[LinearLayouts] Faster pext algorithm

d3d1d32

We also skip the LinearLayout test for HIP as it's currently failing

Skip HIP for Reduction(LinearLayouts)

df66fe4

lezcano requested a review from ptillet as a code owner January 15, 2025 16:16

lezcano mentioned this pull request Jan 15, 2025

[LinearLayouts] Fix Reduce(LinearEncodingAttr) #5617

Merged

ptillet approved these changes Jan 15, 2025

View reviewed changes

lezcano commented Jan 15, 2025

View reviewed changes

lezcano added 2 commits January 15, 2025 16:43

Change the only non-AMD use of getWarpSize and getNumWarpsPerCTA

ee37a63

not this function the other one

3596620

Mogball approved these changes Jan 15, 2025

View reviewed changes

lezcano enabled auto-merge (squash) January 15, 2025 17:48

lezcano merged commit 9895a1f into main Jan 15, 2025
7 checks passed

lezcano deleted the reviews_reduce_linear branch January 15, 2025 17:59

lezcano mentioned this pull request Jan 23, 2025

[AMD] GetThreadsPerWarpForOperand interface #5675

Merged

This was referenced Jan 30, 2025

[Windows] Fix '__builtin_clz' on windows intel/intel-xpu-backend-for-triton#3312

Merged

Fix __builtin_clz implementation on Windows #5774

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[LinearLayouts] Faster pext algorithm #5621

[LinearLayouts] Faster pext algorithm #5621

lezcano commented Jan 15, 2025 •

edited

Loading

lezcano Jan 15, 2025 •

edited

Loading

		if is_hip() and isinstance(src_layout, LinearLayout):
		pytest.skip("FIXME: LinearLayout not supported on HIP")

[LinearLayouts] Faster pext algorithm #5621

[LinearLayouts] Faster pext algorithm #5621

Conversation

lezcano commented Jan 15, 2025 • edited Loading

lezcano Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

lezcano commented Jan 15, 2025 •

edited

Loading

lezcano Jan 15, 2025 •

edited

Loading